
In this section, we present scalability results using two test cases.  All the scalability experiments were performed 
on the {\it{Jaguar}} supercomputer at Oak Ridge National Laboratory (ORNL). The architectural details for this supercomputer
 can be found in \cite{jaguar}. The code is written in the C++ language and built on top of \texttt{PETSc} and \texttt{DENDRO} packages. 
 Specifically, we use \texttt{DENDRO} for managing linear octrees and \texttt{PETSc} for managing distributed regular grids and profiling. 

{\em Example 1.} In this example, we consider a uniform tensor-product grid obtained by using Gaussian quadrature rule to
 compute (\ref{heat}) within the unit cube. The source strength $f(y)$ is chosen randomly and the targets are same as the
  sources. We used the expansion algorithm and incorporated the acceleration techniques
  introduced in \cite{fggt} for tensor-product grids. They are based on {\em separation of variables} and
  the complexity of the S2W and L2T steps is reduced from $\bigO(p^3 N)$ to $\bigO(p N)$. 

We report the weak scalability results in Figure \ref{f:isoUniform}. The grain size is fixed to approximately 30 million 
sources/targets per CPU. We also reduce $\delta$ so that the number of FGT boxes per CPU remains fixed. We get excellent
 scalabilty in this case--only a small increase in timing as we go from 1 to 4096 CPUs. This is  because the cost is dominated 
 by the S2W and L2T steps and both of them are embarassingly parallel in our implementation. There is only one communication 
 step in the algorithm: a point-to-point communication in the W2L step for communicating the plane-wave expansions of ghost boxes.

\begin{figure}
	\begin{center}
	\input{isoUniformPlot}
	\end{center}
\caption{\label{f:isoUniform} Isogranular scalability for an uniform point distribution. For
 this experiment, we set $\epsilon = 10^{-6}$. The reported times for 
each component are the maximum values for that component across all the processors. The total wall-clock
time is reported in bold face.} 
\end{figure}

{\em Example 2.} In this example, we consider a Gaussian random distribution of points in the unit cube. We construct
a linear octree from these points such that each leaf contains $7^3$ sources/targets within it. We used 
 our octree-based FGT algorithm. We present the weak scalability results in Figure \ref{f:isoGaussian}. We varied the
 number of input points such that the number of expand leaves per CPU and the number of direct leaves per CPU remain
  approximately constant as we vary the number of CPUs. There is a significant increase in the timings between the 
  1 CPU and 8 CPU cases because there were no direct leaves for the 1 CPU case. Otherwise, the scalability is very good.

Note that the timings in the nonuniform case have increased considerably compared to the uniform case (Figure \ref{f:isoUniform}). A 
major factor is that the point distribution is random and we cannot use the tensor-product acceleration in the current example. However,
 there are a few other accelerations that we have not incorporated, we list them below.
%
\begin{itemize}
  \item In \cite{fggt}, it is noted that a Hermite expansion is much more effective in condensing the source information than a plane-wave
   expansion. In the S2W step, significant computational speedup can be achieved by first forming the Hermite expansion and then converting 
   it to a plane-wave expansion using the scheme proposed in \cite{fggt}. Similary, in the L2T step, instead of directly evaluating the 
   local plane-wave expansion, it is beneficial to first convert the local plane-wave expansion into a local Taylor expansion and then evaluate 
   the Taylor expansion at the targets. 
  \item In this example, we have set $c = 1$. From Figure \ref{f:isoGaussian}, it is clear that the tree $T_d$ is taking disproportionately 
  more time than $T_e$. There exists an optimal value for the $c$ depending on the free parameters $m$, $\delta$ and $\epsilon$, below which
 the timings are higher because of higher cost for the truncation algorithm and above which, the timings are higher because
 there will be fewer points for FGT box. However, we did not compute the optimal value for this experiment; our choice of $c$ for this experiment
 was rather arbitrary.
\end{itemize}


\begin{figure}
	\begin{center}
	\input{isoGaussianPlot}
	\end{center}
\caption{\label{f:isoGaussian} Isogranular scalability for a gaussian point distribution. For
 this experiment, we set $\epsilon = 10^{-3}$. The reported times for each component are the
 maximum values for that component across all the processors. The total wall-clock
time is reported in bold face.} 
\end{figure}
